EXPLORATORY DATA ANALYSIS.

This report analysis the dataset which includes details regarding Fatal police shootings in the US that have caused deaths.

About the data:

id: a unique identifier for each victim

name: the name of the victim

date: the date of the fatal shooting

manner_of_death:

  • shot
  • shot and Tasered

armed: indicates that the victim was armed with some sort of implement that a police officer believed could inflict harm

  • undetermined: it is not known whether or not the victim had a weapon
  • unknown: the victim was armed, but it is not known what the object was
  • unarmed: the victim was not armed

age: the age of the victim

gender: the gender of the victim. The Post identifies victims by the gender they identify with if reports indicate that it differs from their biological sex.

  • M: Male
  • F: Female
  • None: unknown

race:

  • W: White, non-Hispanic
  • B: Black, non-Hispanic
  • A: Asian
  • N: Native American
  • H: Hispanic
  • O: Other
  • None: unknown

city: the municipality where the fatal shooting took place. Note that in some cases this field may contain a county name if a more specific municipality is unavailable or unknown.

state: two-letter postal code abbreviation. the state in which the incident took place.

signs of mental illness: News reports have indicated the victim had a history of mental health issues, expressed suicidal intentions or was experiencing mental distress at the time of the shooting.

threat_level: The threat_level column was used to flag incidents for the story. The general criteria for the attack label was that there was the most direct and immediate threat to life. That would include incidents where officers or others were shot at, threatened with a gun, attacked with other weapons or physical force, etc. The attack category is meant to flag the highest level of threat. The other and undetermined categories represent all remaining cases. Other includes many incidents where officers or others faced significant threats.

flee: News reports have indicated the victim was moving away from officers

  • Foot
  • Car
  • Not fleeing

body_camera: News reports have indicated an officer was wearing a body camera and it may have recorded some portion of the incident.

In [97]:
import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt
import numpy as np
import seaborn as sns
sns.set_theme(style="darkgrid")
import missingno as msno
import sidetable
%matplotlib inline

importing libraries

In [98]:
data=pd.read_csv('fatal-police-shootings-data.csv')

Reading the .csv file that contains this data.

In [99]:
df=data.head()
df
Out[99]:
id name date manner_of_death armed age gender race city state signs_of_mental_illness threat_level flee body_camera
0 3 Tim Elliot 2015-01-02 shot gun 53.0 M A Shelton WA True attack Not fleeing False
1 4 Lewis Lee Lembke 2015-01-02 shot gun 47.0 M W Aloha OR False attack Not fleeing False
2 5 John Paul Quintero 2015-01-03 shot and Tasered unarmed 23.0 M H Wichita KS False other Not fleeing False
3 8 Matthew Hoffman 2015-01-04 shot toy weapon 32.0 M W San Francisco CA True attack Not fleeing False
4 9 Michael Rodriguez 2015-01-04 shot nail gun 39.0 M H Evans CO False attack Not fleeing False

This prints first five entries to give us an idea about the the data

In [100]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5416 entries, 0 to 5415
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       5416 non-null   int64  
 1   name                     5416 non-null   object 
 2   date                     5416 non-null   object 
 3   manner_of_death          5416 non-null   object 
 4   armed                    5189 non-null   object 
 5   age                      5181 non-null   float64
 6   gender                   5414 non-null   object 
 7   race                     4895 non-null   object 
 8   city                     5416 non-null   object 
 9   state                    5416 non-null   object 
 10  signs_of_mental_illness  5416 non-null   bool   
 11  threat_level             5416 non-null   object 
 12  flee                     5167 non-null   object 
 13  body_camera              5416 non-null   bool   
dtypes: bool(2), float64(1), int64(1), object(10)
memory usage: 306.8+ KB

This tells the information type of the respective entries

In [101]:
data.drop(['name'], axis=1, inplace=True)

Since name is a useless parameter in the EDA. The whole column is dropped.

In [102]:
data.isna().sum()
Out[102]:
id                           0
date                         0
manner_of_death              0
armed                      227
age                        235
gender                       2
race                       521
city                         0
state                        0
signs_of_mental_illness      0
threat_level                 0
flee                       249
body_camera                  0
dtype: int64

This gives number of empty entries in the each column.

In [103]:
sns.heatmap(data.isna(),yticklabels=False,cbar=False,cmap='viridis')
Out[103]:
<AxesSubplot:>
2020-10-05T18:55:51.583784 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/

The heatmap gives us an idea about density of occurence of the empty entries in each column. The purple colour are filled and yellow are the empty entires.

***We can see that there are empty entires in the following columns:

  1. armed
  2. age
  3. race
  4. gender
  5. flee***
In [104]:
duplicate=data.duplicated()
data[duplicate]
Out[104]:
id date manner_of_death armed age gender race city state signs_of_mental_illness threat_level flee body_camera

checking for duplicate value

none found

In [105]:
#sorting  data acc. to id 
data.sort_values(by=['id'])
data.drop(['id'], axis=1,inplace=True)
data.insert(0, 'id', data.index + 1)
data.head()
Out[105]:
id date manner_of_death armed age gender race city state signs_of_mental_illness threat_level flee body_camera
0 1 2015-01-02 shot gun 53.0 M A Shelton WA True attack Not fleeing False
1 2 2015-01-02 shot gun 47.0 M W Aloha OR False attack Not fleeing False
2 3 2015-01-03 shot and Tasered unarmed 23.0 M H Wichita KS False other Not fleeing False
3 4 2015-01-04 shot toy weapon 32.0 M W San Francisco CA True attack Not fleeing False
4 5 2015-01-04 shot nail gun 39.0 M H Evans CO False attack Not fleeing False

sorting w.r.t to new reseted id starting from 1

In [106]:
data['date_parsed']=pd.to_datetime(data['date'],format = "%Y-%m-%d")
data['date_parsed'].head()
Out[106]:
0   2015-01-02
1   2015-01-02
2   2015-01-03
3   2015-01-04
4   2015-01-04
Name: date_parsed, dtype: datetime64[ns]

check datatype of date column; code 0 is object. changing it to format of yyyy-mm-dd to be recognised by program as dates (parsing)

In [107]:
data.describe(include='all')
Out[107]:
id date manner_of_death armed age gender race city state signs_of_mental_illness threat_level flee body_camera date_parsed
count 5416.00000 5416 5416 5189 5181.000000 5414 4895 5416 5416 5416 5416 5167 5416 5416
unique NaN 1844 2 93 NaN 2 6 2470 51 2 3 4 2 1844
top NaN 2018-04-01 shot gun NaN M W Los Angeles CA False attack Not fleeing False 2020-05-26 00:00:00
freq NaN 9 5146 3060 NaN 5176 2476 85 799 4200 3495 3411 4798 9
first NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2015-01-02 00:00:00
last NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2020-06-16 00:00:00
mean 2708.50000 NaN NaN NaN 37.117931 NaN NaN NaN NaN NaN NaN NaN NaN NaN
std 1563.60886 NaN NaN NaN 13.116135 NaN NaN NaN NaN NaN NaN NaN NaN NaN
min 1.00000 NaN NaN NaN 6.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
25% 1354.75000 NaN NaN NaN 27.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
50% 2708.50000 NaN NaN NaN 35.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
75% 4062.25000 NaN NaN NaN 46.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
max 5416.00000 NaN NaN NaN 91.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN

Prints detail descripton with mean and other parameters

***1. Mean age of all the people that died in police shootings is 37.

  1. Majority of people who died were armed with a 'gun'.
  2. The top city with most incidents is Los Angeles and the top state is California(CA).
  3. Race'W' were the people who had majority encounters.***
In [108]:
print(data.flee.value_counts(),'\n')
print(data.armed.value_counts())
Not fleeing    3411
Car             900
Foot            692
Other           164
Name: flee, dtype: int64 

gun               3060
knife              792
unarmed            353
toy weapon         186
undetermined       164
                  ... 
ice pick             1
Airsoft pistol       1
wasp spray           1
bayonet              1
oar                  1
Name: armed, Length: 93, dtype: int64

Finding mode of flee and armed columns.

MODE is the entry that has the maximum occurence frequency/occured most number of times.

In [109]:
data.flee.fillna('Not fleeing',inplace=True)
data.armed.fillna(data.armed.value_counts().index[0], inplace=True)
data.dropna(inplace=True) #inplace true is used to specify that to drop only those where the condition is true
age=np.array(data['age'])
age_mean=round(np.nanmean(age),0)

entering the mode value of flee and armed columns to empty columns and dropping rest empty columns

The age and race columns are not treated the same because these factors are heavy weighted in analysis and filling them with mode can result in inconsistent data. Both of the columns are dropped.

In [110]:
data.isna().sum()
Out[110]:
id                         0
date                       0
manner_of_death            0
armed                      0
age                        0
gender                     0
race                       0
city                       0
state                      0
signs_of_mental_illness    0
threat_level               0
flee                       0
body_camera                0
date_parsed                0
dtype: int64

There no more empty entries as we can see

In [111]:
print(data.stb.freq(['race']),"\n")
print(data.stb.freq(['threat_level','flee']),"\n")
print(data.stb.freq(['state'],thresh=55),"\n")
  race  count    percent  cumulative_count  cumulative_percent
0    W   2442  50.779788              2442           50.779788
1    B   1274  26.491994              3716           77.271782
2    H    878  18.257434              4594           95.529216
3    A     91   1.892285              4685           97.421501
4    N     77   1.601164              4762           99.022666
5    O     47   0.977334              4809          100.000000 

    threat_level         flee  count    percent  cumulative_count  \
0         attack  Not fleeing   2079  43.231441              2079   
1          other  Not fleeing   1012  21.043876              3091   
2         attack          Car    527  10.958619              3618   
3         attack         Foot    407   8.463298              4025   
4          other          Car    245   5.094614              4270   
5          other         Foot    193   4.013308              4463   
6   undetermined  Not fleeing    126   2.620087              4589   
7         attack        Other    100   2.079434              4689   
8          other        Other     42   0.873362              4731   
9   undetermined         Foot     34   0.707008              4765   
10  undetermined          Car     34   0.707008              4799   
11  undetermined        Other     10   0.207943              4809   

    cumulative_percent  
0            43.231441  
1            64.275317  
2            75.233936  
3            83.697234  
4            88.791849  
5            92.805157  
6            95.425244  
7            97.504679  
8            98.378041  
9            99.085049  
10           99.792057  
11          100.000000   

     state  count    percent  cumulative_count  cumulative_percent
0       CA    681  14.160948               681           14.160948
1       TX    418   8.692036              1099           22.852984
2       FL    322   6.695779              1421           29.548763
3       AZ    221   4.595550              1642           34.144313
4       CO    165   3.431067              1807           37.575379
5       GA    156   3.243918              1963           40.819297
6       OK    146   3.035974              2109           43.855271
7       OH    145   3.015180              2254           46.870451
8       NC    144   2.994386              2398           49.864837
9       WA    124   2.578499              2522           52.443335
10  others   2287  47.556665              4809          100.000000 

To get details based on various different parameters and finding their relationship.

***Inference:
In the last five years:

  1. More than 50% of all the shootings have happened with Race'W' while more than 26% have happened with Race'B'. And Race'H' were involved 18% encounters.
  2. A big majority of 43% of ALL the encounters were marked attacked where the victim was not trying to flee.
  3. California(CA) is the state with most enounters at 14%. And Texas next with 8.6%.***
In [112]:
sns.catplot(x="age",y="race",hue="gender",orient="h",row="threat_level",kind="box",data=data)
Out[112]:
<seaborn.axisgrid.FacetGrid at 0x14588a00>
2020-10-05T18:56:04.059248 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/

A box plot tells a lot about the data. The image above represents relation between threat level,race and gender with their mean, max and min values.

In [113]:
sns.distplot(age)
Out[113]:
<AxesSubplot:ylabel='Density'>
2020-10-05T18:56:05.246090 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/

The above visualisation tells that the majority number of incidents have happened with age group (20-35)

In [114]:
sns.catplot(x="threat_level",hue="signs_of_mental_illness",kind="count",data=data)
Out[114]:
<seaborn.axisgrid.FacetGrid at 0x294403a0>
2020-10-05T18:56:06.078898 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/

expresses the count difference between signs of mental illness and threat level.